Protein loops, solitons and side-chain visualization 
with applications to the left-handed helix region 
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Folded proteins have a modular assembly. They are constructed from regular secondary structures 
like a-helices and /3-strands that are joined together by loops. Here we develop a visualization 
technique that is adapted to describe this modular structure. In complement to the widely employed 
Ramachandran plot that is based on toroidal geometry, our approach utilizes the geometry of a two- 
sphere. Unlike the more conventional approaches that only describe a given peptide unit, ours is 
capable of describing the entire backbone environment including the neighboring peptide units. It 
maps the positions of each atom to the surface of the two-sphere exactly how these atoms are seen 
by an observer who is located at the position of the central C a atom. At each level of side-chain 
atoms we observe a strong correlation between the positioning of the atom and the underlying local 
secondary structure with very little if any variation between the different amino acids. As a concrete 
example we analyze the left-handed helix region of non-glycyl amino acids. This region corresponds 
to an isolated and highly localized residue independent sector in the direction of the Cp carbons on 
the two-sphere. We show that the residue independent localization extends to C 7 and Cs carbons, 
and to side-chain oxygen and nitrogen atoms in the case of asparagine and aspartic acid. When we 
extend the analysis to the side-chain atoms of the neighboring residues, we observe that left-handed 
/3-turns display a regular and largely amino acid independent structure that can extend to seven 
consecutive residues. This collective pattern is due to the presence of a backbone soliton. We show 
how one can use our visualization techniques to analyze and classify the different solitons in terms 
of selection rules that we describe in detail. 



I. INTRODUCTION 

The Ramachandran plot pQ, [2] is the paradigm tech- 
nique of protein visualization. It describes backbone 
atoms in a peptide group around a given C Q carbon in 
terms of dihedral rotations. Ramachandran plot can also 
been extended to the side-chain atoms in terms of the 
dihedral rotamers. This gives rise to Janin plot [5] and 
its variants. In the present article we develop new visu- 
alization techniques to describe proteins. Our goal is to 
visualize all atoms both in a given peptide unit and those 
in the neighboring units, beyond the regime of the Ra- 
machandran plot. This will enable us to search for new 
relations between the positioning of various atoms and 
the backbone geometry. Our approach draws from de- 
velopments in three dimensional visualization that have 
taken place after the Ramachandran plot was originally 
introduced [I], [5]. In particular, in lieu of toroidal ge- 
ometry we utilize the geometry of a two-sphere. It en- 
ables us to describe the various atoms exactly as they 
are seen by an observer who roller-coasts along the back- 
bone. Of particular interest to us is the visual analysis of 
the modular components of which all folded proteins are 
built. These have been recently identified as the soliton 
solutions to a generalized discrete nonlinear Schrodinger 
equation (DNLS) [B]-[S]. 
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Soliton solutions to nonlinear difference equations 
share a long history with biological physics of proteins. 
The discrete version of the nonlinear Schrodinger equa- 
tion is an embodiment of this relationship. It was origi- 
nally introduced by Davydov [9] to describe energy trans- 
fer along the protein a-helices. Subsequently the DNLS 
equation has found many additional applications in bio- 
logical physics and elsewhere [TO]. The DNLS equation 
has also the remarkable mathematical property of inte- 
grability, it is commonly viewed as the archetype inte- 
grable system 

When the DNLS soliton propagates along the a-helix, 
the protein changes its shape. In [6], [7] it has been shown 
that when the soliton becomes trapped, the protein folds. 
It now appears that practically all folded proteins can be 
built in a modular fashion from a relatively small num- 
ber of such trapped solitons [5]. In the present article 
we combine the notion of soliton with modern visualiza- 
tion techniques [12]. We are particularly interested in 
the ramifications of the backbone DNLS soliton in pro- 
tein side-chain geometry. Our ultimate goal is to develop 
a graphical characterization and eventually a full classifi- 
cation of protein structures in terms of their soliton mod- 
ules. As a prelude, we here utilize the soliton concept to 
visually inspect and analyze those protein conformations 
that are located in the left handed a-helix (L-a) region 
of the Ramachandran plot p], [2]- This region is a rela- 
tively small subset of all different protein conformations, 
and as such amenable to an explicit analysis. 

Of particular interest to us are the common geomet- 
ric aspects of the asparagine (ASN) and aspartic acid 
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(ASP). Asparagine is the predominant residue in the so- 
called non-glycyl L-a region. According to the prevailing 
point of view this is due to a localized non-covalent at- 
tractive carbonyl-carbonyl interaction between the side- 
chain and backbone [T3]-[TS]. Such a carbonyl-carbonyl 
interaction can only be present in ASN, ASP, glutamine 
(GLN) and glutamic acid (GLU). Indeed, the propensity 
of ASP that is structurally very similar to ASN is also 
clearly amplified in the L-a region, while the somewhat 
lower propensity of GLN and GLU has been explained in 
the literature to be a consequence of steric suppressions 

Here we show that the presence of a L-a site goes be- 
yond the regime of a single peptide unit. We find that it 
involves a coordinated interplay of up to seven consecu- 
tive amino acids. We argue that this extended correla- 
tion over several amino acids is symptomatic to solitons. 
We perform a detailed visual investigation and propose a 
graphical classification of these solitons. We argue that 
all protein structures could be characterized and clas- 
sified similarly, in terms of general selection rules that 
we formulate. We find that the continuous geometry of 
the two-sphere gives a more perceptible characterization 
of protein conformations than the toroidal Ramachan- 
dran plot. In fact, the three dimensional visualization 
techniques we utilize have been largely introduced and 
developed after the publication of p] , [2] . Our approach 
exploits the properties of a piecewise linear framed chain, 
as it is being applied to visualization problems in aircraft 
and robot kinematics, stereo reconstruction, and increas- 
ingly in computer graphics and virtual reality [3], [5]. In 
these applications different framings correspond to dif- 
ferent camera gaze positions, that one introduces and 
varies for the purpose of extracting diverse and comple- 
mentary information on geometrical aspects and physical 
properties of the system under investigation. However, 
largely due to the success and systematics provided by 
Ramachandran plot, thus far this kind of approach has 
been sparsely applied to the analysis of protein confor- 
mations. Among our goals is to demonstrate that these 
modern visualization techniques can provide a powerful 
complementary tool for the visual description of folded 
proteins. In particular, they enable the study of visual 
correlations between nearby peptide units, which is not 
possible in the Ramachandran approach that is limited 
to to describe a single peptide unit only. 

Finally, we note that the investigation of the physical 
properties of our concrete examples ASN and ASP is also 
of substantial biological interest. These two amino acids 
are more frequently than any other amino acid subject to 
in vivo post-translational modifications including sponta- 
neous nonenzymatic deamidation from ASN to ASP [16] 
and racemization from L-ASP into D-ASP [IT]- These 
processes are presumed to have consequences to cellular 
and organismal ageing [16], [18]. They might also have 
a role in enhancing the emergence of amyloid based neu- 
rodegenerative diseases [18], [T§] . 



II. FRAMING 

We interpret a protein backbone in terms of framed 
chain, with vertices located at the C a carbons [T2]. De- 
pending on the application, the framing can be intro- 
duced in various different ways. Examples include the 
geometric Frenet frame [3] , [5] , the geodesic Bishop frame 
|20) . and protein specific Cp carbon frame that we obtain 
by utilizing the direction of the carbon along a protein 
backbone to construct an orthonormal framing [12] . Here 
we propose that in particular the Frenet framing provides 
a powerful tool for protein side-chain visualization, also 
beyond our explicit example of the L-a Ramachandran 
region. The additional advantage of the Frenet framing 
is that it relates directly to an energy function. But we 
also advertise the closely related framing that may 
sometimes have certain visual advantages. 

The framing of a piecewise linear chain is convention- 
ally based on the Denavit-Hartenberg [3T] formalism. 
This formalism was originally introduced in robotics but 
has been subsequently extensively applied also in other 
disciplines. Here we resort to a variant, that has been 
developed in [T2] for the purpose of framing protein back- 
bones. It utilizes the transfer matrix formalism |llj to 
describe a protein with N residues using the coordinates 
Yi of the backbone C a carbons (i = 1, N). These coor- 
dinates can be downloaded from the Protein Data Bank 
(PDB) [35]. For each of the segments that connect the 
backbone C a central carbons we compute the unit length 
tangent vector tj, binormal vector bj and normal vector 
rij using 

_ Vj+l - Yj 

|r i+ i-rj| 

tj_i X tj 

"i = i r (1) 

|tj_l Xtj| W 

ix, = bj x tj 

Thus the tangent vector tj points from the i th central C a 
carbon to the direction of the (i + l) th central C a carbon, 
the way how it is seen by an observer who is located at 
the position of the i th carbon. The bj and rij determine 
a frame that enables the observer to orient herself at the 
location r,, on the plane that is orthogonal to the direc- 
tion tj. Together the right-handed triplet (rij, bj, tj) con- 
stitutes the orthonormal discrete Frenet frame for each 
residue along the backbone chain, with base at the posi- 
tion of the vertex rj. The corresponding backbone bond 
/Cj+ijj = Ki and torsion Tj+i^ = Tj angles can be com- 
puted from ([I]) as follows, 

COS Ki = tj + i • tj (2) 

cos Ti = bj+i • h t (3) 
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Alternatively, if the bond and torsion angles are known 
we can construct the frames iteratively by starting from 
the N terminus and using [TJ] 

= exp{ Ki T 2 } • exp^T 3 } ^bj (4) 

where the T a (a — 1, 2, 3) are the adjoint SO(3) Lie alge- 
bra generators. Once the ti have been constructed from 
Q and the bond lengths Si = \r i+ i — have been de- 
termined we recover the entire backbone from 

fc-i 

rfc = ^2 Si ■ U (5) 

i=0 

The set of all Frenet frames defines a framing of the back- 
bone. According to ((2])-((4| the bond and torsion angles 
are link variables, they relate a frame at the vertex r; to 
a frame at the vertex r i+1 . We note that the definition of 
the bond angle involves three vertices while the definition 
of the torsion angle involves a total of four vertices. 

The Cp framing [12] is a complement to the Frenet 
framing. It can be introduced for all non-glycyl residues. 
We define these frames similarly, in terms of three mu- 
tually orthogonal unit vectors at each C a carbon. Con- 
sequently the Frenet framing and the Cp framing are re- 
lated to each other by peptide unit dependent SO (3) ro- 
tations. The first unit vector of the Cp basis is obtained 
as follows, 

\rp,i-r a ,i\ 

Here r a ^ is the location of the ith C a atom, and rp i is 
the location of the corresponding Cp atom. The second 
unit vector is 

_ s i X tj 

Pi |s,xti| 

where tj is the Frenet frame unit tangent vector. Finally, 
the third unit vector in the Cp frame is 

qi = Si x pi 

Since (s,, Pi, q,) is an orthonormal frame located at each 
C Q , it can be used like the Frenet frame to visualize the 
various atoms along the protein backbone. Moreover, 
since 

ti = Pi X Si 

we can likewise use the C p framing to construct the entire 
backbone using 



We wish to employ the various frames together with 
the discrete Frenet equation Q to inspect the structure 
of folded proteins. As our principal data set we utilize 
all those proteins that are presently in PDB and have 
an overall resolution that is better than 2.0 Angstrom. 
We introduce no additional curation or data pruning in 
this set. But we have confirmed that all our results and 
conclusions stand when we restrict ourselves to those pro- 
teins with resolution better than 1.5 A and with less than 
30% homology relation, or to those proteins that have a 
resolution which is better than 1.0 A. Finally, as a control 
set we also utilize the highly curated version v3.3 Library 
of chopped PDB files for representative CATH domains 
[23] . Since the conclusions we draw are indifferent of the 
data set that we use we only describe explicitly the re- 
sults for the first one, as it allows for the visually most 
complete presentation. 



III. BACKBONE MAPPING 

We start by describing how to visualize the protein 
backbone in terms of the Frenet frames [T2] . Here we go 
beyond the regime of the Ramachandran plot, that does 
not provide any direct visual correlation between neigh- 
boring peptide groups. We introduce an observer who 
maps all the atoms in the protein by traversing along the 
backbone. The observer moves between the C a carbons 
like on a roller-coaster with an orientation that is deter- 
mined by the discrete Frenet framing: We take the base 
of the tangent vector ti defined in ([lj to be at the loca- 
tion Ti of the i th central C a carbon. The tip of tj then 
determines a point on the surface of a unit two-sphere 
that surrounds our observer at the location of this C a 
carbon. The observer uses this two-sphere to constructs 
a map of the various atoms exactly the way how she sees 
them on the surface of the sphere, as if the atoms were 
stars in the sky. For this she always orients the two- 
sphere at the site i so that the north-pole coincides with 
the tip of ti i. e. the north-pole is always in the direction 
of the next C a at the site rj+i. She takes the bond angle 
to measure the latitude of the two-sphere from its north 
pole. The torsion angle measures the longitude starting 
from the great circle that passes both through the north 
pole and through the tip of the binormal vector hi. In 
terms of these angles she can characterize the direction 
of the vector tj+i i.e. the direction towards site i"i + 2 
to which the roller coaster turns at the next C a carbon. 
Consequently she acquires information about the geomet- 
ric relations between neighboring peptide units, and this 
goes beyond the regime of the Ramachandran plot. She 
proceeds as follows: 

She first translates the center of the two-sphere from 
the location of the i th central carbon towards its north- 
pole and all the way to the location of the (i + l) th central 
carbon, without introducing any rotation of the sphere. 
She then records the direction of t i+1 as a point on the 
surface of the two-sphere. This defines the corresponding 
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coordinates (/Cj,Tj) and marks a point on the map. It 
gives an instruction to the observer at the point r^, how 
she should turn at site r i+1 , to reach the (i + 2) th central 
C a carbon at the point 

She then continues to construct the mapping with the 
next C a carbon along the backbone. She rotates the 
two-sphere at r i+1 so that the north pole of the rotated 
sphere coincides with the tip of tj+x, an d so that the 
torsion angle measures the longitude from the great circle 
determined by the north-pole and the tip of hi + ±. She 
repeat the procedure for all C a , until she has mapped the 
entire backbone. We note that for a folded protein the 
two vectors tj and tj+i are never exactly parallel to each 
other so there is never any ambiguity due to an inflection 
point. 

When we repeat this mapping procedure for every C a 
in all proteins in our data set, we obtain a [k, r) dis- 
tribution that characterizes the overall geometry of pro- 
tein backbones. This provides non-local information on 
the backbone geometry that extends over several pep- 
tide units. In particular, we now have a map that shows 
exactly how the central carbons are seen by our roller- 
coasting observer when she gazes at them from her Frenet 
frame positions along the backbone. 

We find that the C a distribution for all proteins in our 
data set determines an annulus on the surface of the two- 
sphere. For visualization it then becomes convenient to 
employ the geometry of the stereographically projected 
two-sphere. It is obtained by projecting our (k, t) coordi- 
nates to the north pole tangent plane of the two-sphere. 
If (x, y) are the coordinates of this tangent plane the pro- 
jection is defined by 

x + iy = tan(~) • e _lr (6) 

When we perform this projection for all C a carbons in all 
proteins that are in our data set and separately display 
the results for the different groups of a-helices, /3-strands, 
3/10-helices and loops as these structures are defined in 
PDB, we arrive at the angular distributions that we show 
in Figures 1. For our observer who always fixes her gaze 
position towards the north-pole of the surrounding two- 
sphere at each C a carbon, i.e. towards the black dot 
at the center of the annulus, the color intensity reveals 
the likely direction to which the roller coaster who is 
located at position turns at the next C a carbon, when 
she starts moving from its location at rj+i towards rj+2- 
In particular, the four maps in Figure 1 are in a direct 
visual correspondence with the way how the Frenet frame 
observer perceives the backbone geometry. 

The four maps in Figure 1 portray non-local features 
that are not available in conventional Ramachandran 
plots. Moreover, instead of a discontinuous toroidal 
square as in the case of the Ramachandran plots, the pre- 
dominant feature in all of the present maps is that the 
PDB data is concentrated in a continuous annulus which 
is roughly between the circles n in rj 1 and n out ~ ir/2. 
The exterior of the annulus n > K out is an excluded re- 
gion, it describes conformations that are subject to steric 




FIG. 1. (Color online:) The four major protein structures: 
a) a-helices, b) /3-strands, c) 3/10-helices and d) loops. We 
define these structures according to their PDB classification in 
our 2.0 A data set. In each Figure the center of the annulus is 
the north-pole of the two-sphere that surrounds the observer 
at the position i. This is the direction where the next C a is 
located. At this point the bond (latitude) angle n — 0. The 
bond angle measures distance from the center of the annulus 
so that the south pole where n — n corresponds to points 
at infinity on the plane. The torsion i.e. longitude angle 
r £ [— 7r, 7r] increases by 2tt when we go around the center of 
the annulus in counter-clockwise direction. The color coding 
in all our Figures increases from white to blue to green to 
yellow to red describes the relative number of conformations 
in PDB in a log-squared scale. The intensity is proportional 
to the probability of the direction where the observer turns 
at the next C a carbon. 
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clashes. The interior k < Ki n is sterically allowed but 
practically excluded as long as proteins remain in the col- 
lapsed phase; The interior region becomes occupied when 
we cross the O-point and proteins assume their unfolded 
conformations. 

We notice that loops appear to have a slightly higher 
tendency to bend towards left i.e. r < 0. We also note 
that in the Figures for a, (3 and 3/10 the blue regions 
correspond to residues where the present hydrogen-bond 
based PDB classification is in a miss-match with the geo- 
metric structure that is commonly associated with these 
configurations. Moreover, the Figure reveals that the 
PDB data displays innuendos of various underlying re- 
flection symmetries: In the Figure Id (loops) there is 
a clearly visible mirror of the standard right-handed a- 
helix region, located in the vicinity of the outer rim 
with k ~ 3/2 and with torsion angle close to the value 
t — 2ir/3. A helix in this regime would be left-handed 
and tighter than the standard a-helix. There is also a 
clear mirror structure in the Figure lb for /? strands, the 
standard region is (k,t) rs (1,tt) and its less populated 
mirror is located around (k, t) « (1, 0). The mirror sym- 
metry between the ensuing extended regions persists in 
the Figure Id for loops. Finally, in the Figure Id we ob- 
serve a small elevated (yellow) region in the vicinity of 
(k, t) ss (3/2,— 7r/3). This is the region of helices that 
are spatial left-handed mirror images of the standard a- 
hcliccs. There is also a (slightly) elevated (green) mirror 
of this region around (k,t) w (3/2, 2ir/3). This is like 
the (k, r) (3/2,— 2-7r/3) mirror of the standard right- 
handed a helices. 



IV. SIDE-CHAIN MAPPING 

We can similarly visualize the geometry of side-chain 
atoms, as they are seen by our roller-coasting observer. 
This gives us local information on the given peptide unit. 
Now the results turn out to be isomorphic to those re- 
vealed by the standard Ramachandran plot. Moreover, 
this enables us to develop a visual complement to the 
existing rotamer libraries. 

We assume that the observer is oriented according to 
the discrete Frenet framing that is determined by the 
transfer matrix Q at each C a . At the location of the 
C a the observer then looks at the side-chain atoms and 
records the direction of each of them as points on the 
surface of the two-sphere that surrounds the observer, 
with the north-pole of the sphere always coinciding with 
the direction towards the next C a exactly as in the case 
of the backbone. 

In Figure 2 (top) we display the angular distribution 
of the Cp carbons on the surface of the two-sphere for 
all the C a carbons, as recorded by our Frenet frame ob- 
server who is located at the origin of the sphere. Recall 
that a Cp carbon is present in all non-glycyl residues. 
We note that our framing is determined entirely in terms 
of the backbone. According to prevailing paradigm the 



directions of the Cp carbons should then be directly com- 
putable from the geometry of the tetrahedral covalent 
bond structure of the pertinent C a carbon. However, 
Figure 2 (top) reveals that the directions of the Cp car- 
bons are not determined only by the local covalent bond 
structure. In addition, these directions are clearly sub- 
ject to secondary structure dependent but amino acid 
independent nutations. This confirms that at the level 
of accuracy of our data, the stereochemical restraints fail 
to be fully universal. They reflect the secondary struc- 
ture environment [23]" EH] ■ In fact, despite being based 
entirely on the Cp atoms the Figure 2 is fully isomorphic 
to the standard Ramachandran plot, for all amino acids 
except for glycine that has no C^. 

A important feature of the nutation is the presence of 
the highly localized, isolated island denoted L-a that is 
clearly visible in Figure 2 (top) . We have confirmed that 
this isolated island coincides exactly with the conven- 
tional non-glycyl L-a region of the standard Ramachan- 
dran plot. This is shown in Figure 2 (middle) where we 
display the direction of the Cp carbons solely for those 
non-glycyl residues that are in the L-a Ramachandran 
region. Finally, in the Figure 2 (bottom) we display 
the discrete Frenet frame distribution of the Cp carbons 
for those ASN that are located in loops only, according 
to PDB classification. The relatively high propensity of 
ASN in the L-a island is prominent. 

In the sequel we shall concentrate our attention solely 
on the isolated L-a island in Figure 2 (middle). We start 
by noting the propensity of different amino acids in the 
L-a island. The result (in percent) is shown in Figure 3. 
This Figure confirms the high propensity of ASN (N) 
that is also visible from Figure 2 (bottom). We find 
that ASP (D) has also relatively high relative propen- 
sity. But the propensity of histidine (H) is practically 
equal. Furthermore, several non-carbonylic amino acids 
have a higher propensity than GLU (E). Finally, the /?- 
branched isoleucine (I), valine (V) and threorine (T) all 
have clearly suppressed propensities and proline (P) is 
practically absent, presumably reflecting the presence of 
steric constraints [13], [T5] , 

We now proceed to map the directions of the C 7 car- 
bons for those side-chains where Cp is located in the L-a 
island of Figure 2. We continue to utilize the framing de- 
termined by our observer who roller-coasts the C a back- 
bone with orientation determined by the discrete Frenet 
frames, and north-pole always in the direction of the next 
C Q . The result is presented in Figure 4. It reveals that at 
the level of C 7 , the single L-a island of the Cp becomes di- 
vided into two separate but still highly localized islands. 
This reflects the sp3 hybridization of the C^: There is 
a putative gauche- (g-) island where around 70% of the 
residues in the L-a island are located, and a putative 
trans island for the rest. Interestingly, we do not really 
see any putative gauche+ (<?+) island. 

The amino acid propensities of these two islands is dis- 
played in Figure 5. ASN is the most populous in both 
C 7 islands. However, the propensity of ASP is elevated 
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FIG. 2. (Color online:) The directions of the Cp carbons, 
as seen by our Frenet frame observer who is located at the 
corresponding C a carbon which is at the center of the sphere. 
The vector t points to the direction of the next C a carbon. 
On top all residues in our data set including ASN. In the 
middle we display only the L-a region of the Ramachandran 
plot. On bottom we display only those ASN that are in a loop 
in PDB classification. 
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FIG. 3. (Color online:) The percent distribution of non-glycyl 
residues in the L-a island of Figure 2. In the top Figure we 
display the result for all amino acids in our entire data set, 
and in the bottom Figure for those in our data set that are 
classified as loops in PDB. The propensity of carbonylic ASN 
(N) is clearly enhanced in both cases. But in both cases the 
similarly carbonylic ASP (D) has about the same percent- 
wise propensity with the non-carbonylic HIS (H), and the 
carbonylic GLU (E) is relatively quite suppressed. 



only in trans island. In the g- island both non-carbonylic 
HIS (H) and LYS (K) and even the carbonylic GLN (Q) 
have a higher propensity than ASP. At the moment we 
have no good explanation for this observation, and we 
leave it as challenge. In Figure 6 we plot the percentage 
ratios of the different amino acids as they appear in the 
two C 7 islands. We note that around 43% of residues in 
the putative g— island are non-carbonylic, while in the 
putative trans island the number is much lower, close to 
12%. 

We proceed to the next level along the side-chain, to 
map the Cg carbons, in Figure 7 we plot these carbons 
for those side-chains where Cp is located in the L-a is- 
land. In the Figure 7 on top, we show them as they are 
seen by by our discrete Frenet frame observer who sits at 
the locations of the C a carbons. In the Figure 7 on bot- 
tom we show them as they are seen in the Cp frame for 
an observer now sitting at the Cp location, this time us- 
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FIG. 5. (Color online:) The percent-wise propensity of dif- 
ferent amino acids in the putative g- island (top) and trans 
C 7 -island (bottom) in Figure 4 



FIG. 6. (Color online:) The relative number of different 
amino acids in the putative g— (top) and trans (bottom) C 7 - 
islands. 



ing the stereographic projection. Since ASN (and ASP) 
has no C$ carbon, we display instead the direction of the 
side-chain O atom for ASN, the result is shown in Fig- 
ure 8. In the top Figure 8 we use the C„ based Frenet 
frame observer and in the middle and bottom Figure 8 
we use the Cp frame observer in combinations with stere- 
ographic projection. 

From Figure 7 we observe that the directions of the 
Cs continue to be highly localized, independently of the 
type of amino acid. But unlike in the case of C 7 , we 
find quite surprisingly, that now there is only one clearly 
visible island. We do not have any definite stereochem- 
istry or physics based explanation why the clearly visible 
sp3 hybridization based doubling that we observe at the 
level of C 7 has now completely disappeared. However, 
we do observe the formation of a second, relatively very 
weakly occupied island at larger values of the latitude an- 
gle and with longitude angle \ ~ — 2tt/3. This island is 
clearly visible in Figure 7 (right). There is also a third, 
very faint island in the direction x ~ f that (barely) 
becomes visible in the stereographically projected Fig- 
ure 7 (left). At the moment we do not have a basis to 
conclude whether the extremely low population of the 
second and third island is a real effect or only a reflec- 
tion of problems in the experimental data. We refer to 
[29] , that there are presently an estimated half a million 
incorrectly positioned side-chain atoms PDB data. In 
this light, the reason for the sparse population of the two 




X=0 



FIG. 7. (Color online:) The directions of the Gs-carbons 
in the discrete Frenet frame of the C a carbons on the sur- 
rounding two-sphere (top) and in the Cp frame (bottom). In 
the Figure on top, the C^ atom is located at the origin of 
the two-sphere, which is stereographically projected from the 
north pole (with C a at the south pole). 



additional sp3 hybridized islands should be subjected to 
experimental curiosity, to determine the cause. 

Since ASN has no Cg carbon, in Figure 8 we display 
instead the O atoms of the ASN side-chain according to 
PDB identification. In the top Figure 8 we use discrete 
Frenet frame of the C a -carbons, and in the middle and 
bottom we use the stereographically projected frame. 
We note that the two C 7 islands appear to become di- 
vided into four distinct but still highly localized islands. 
However, we recall that the identification between the 
ASN side-chain O and N can be very difficult, and there 
are apparently numerous errors in the O and N identi- 
fications in PDB data Thus we have displayed in 
Figure 8 (top) the N atoms according to PDB identifi- 
cation as well. By comparing the Figures 8 (middle) and 
(bottom) we propose that most likely the two inner-most 
islands denoted a and b in Figure 8 describe N instead of 



X=0 



Jt 




x=o 



FIG. 8. (Color online:) On top, the directions of the side- 
chain O atoms of ASN in the discrete Frenet frame of the C a 
carbons. In the middle and bottom we use the same stereo- 
graphically projected Cp frames as in Figure 7 (right). The 
Figure in middle displays the same atoms as the Figure on 
top. On bottom, the directions of the side-chain N atoms 
of ASN suggests that the correct identification of a and b 
regions in the top and middle figure should be N and not 
O as in PDB. Similarly, the regions t and g- in the bottom 
should be O and not N in PDB. See [29]. (t is trans and g- 
is gauche-) 
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O atoms. If so, our visualization technique could become 
a useful tool in detecting erroneously identified O and N 
atoms and help to resolve the kind of issues raised in . 
This could be scrutinized by a careful re-analysis of high 
resolution x-ray crystallography data. 

Finally, in Figure 9 we have mapped the locations of 
the C 7 atoms in our entire dataset (except for prolines) 
as they are seen in the stereo graphically projected C— (3 
frame. In the Figure at top we show all atoms except 
those that have in the L-a, and in Figure at bottom 
we show only the L-a atoms. The sp3 hybridization of 
the Cp covalent bond structure is clearly visible. Further- 
more, in each of the three regions in left we recognize the 
substructure that correspond to the a-helices, /3-strands 
and the interconnecting loops. Each of the three regions 
is then isomorphic to Figure 2a. 

Obviously, it is straightforward to continue the present 
analysis to inspect additional side-chain atoms. However, 
here our goal is not to perform a detailed and complete 
analysis of all the side-chain atoms, we simply aim to 
describe a method. 



V. SOLITONS 

The localization we have observed in the L-a side-chain 
atoms proposes that there is an organizational principle 
in the side-chain orientations that extends beyond a sin- 
gle peptide unit. Hence it can not be detected by the 
Ramachandran plot or in terms of the standard rotamer 
libraries, these only provide information on a given pep- 
tide unit. The backbone C a atoms we have inspected all 
correspond to the L-a position of the Gp, this region is 
known to commonly appear in connection of loops in lieu 
of regular secondary structures. Thus the order we have 
observed is a priori not a reflection of any apparently reg- 
ular secondary structure category at the level of the back- 
bone geometry, but a characteristic of loops. We propose 
that it is due to the presence of a soliton solution to a dis- 
crete version of nonlinear Schrodinger (DNLS) equation 
that universally describes the backbone C Q geometry in 
(practically) all folded proteins. 

For the soliton description we do not need to know 
the atomic level details of the energy function. We only 
need to apply general symmetry principles to the abstract 
full quantum mechanical, all-atom Hamiltonian operator 
H[qi,Pi\. Here the index i = 1, N labels all pairs of 
canonical coordinates (qi,Pi) that describe the elemen- 
tary constituents. These include the individual C, O, N, 
H and every other atom in the protein and in the solvent. 
We also account for the valence electrons, and for every 
local and long range interaction between all the atoms 
both in the protein and in the solvent. For simplicity we 
take all the variables to be point-like, that is we work at 
the first quantized level. The canonical partition function 
is computed by the path integral, 




FIG. 9. (Color online:) The directions of the C 7 atoms 
in the Cp centered frames, on two-sphere stereographically 
projected from its north pole and with C a at south-pole. On 
the top all those C 7 atoms for which the C@ is not in the L-a, 
and on bottom those that correspond to Cp in the L-a island. 
(gauche- on left, t on top-right and gauche+ on bottom-right.) 



where S(q, q) is the classical Euclidean action of H[qi,Pi\. 
The integration extends over all period configurations 
(anti-periodic in the case of fermions). With the partition 
function we obtain the thermodynamical Helmholtz free 
energy as follows: We introduce external sources ji(t) and 
extend the partition function into the generating func- 
tional of connected Green's functions, 



W[j] = In 



[dq]e 



We introduce the Legendre transformation 

hp 



r[?] 



i 

1 



w\j] 



sw 



3 



Z = Tre 



[dq}e-i S ^^ 



This defines the effective action that coincides with the 
Helmholtz free energy when we take the limit ji — > 0. 
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There are various methods to compute the Helmholtz 
free energy E. Here we introduce a finite difference ver- 
sion of the gradient expansion. In the leading nontrivial 
order we have 



E 



lim rial 



dq 



(7) 



i+l| 9i+1 =0 



The potentials are all local and the FW are bi-local. 
The higher order terms in the expansion are either higher 
order polynomials in the nearest neighbor variables, or 
terms that introduce couplings between next-to-nearest 
neighbor variables. 

In the expansion ([7]), in the case of the backbone we 
identify the generalized coordinates qi with the bond and 
torsion angles r,) in ([2]), ([3]). We assume that all the 
additional variables that appear in the full Hamiltonian 
operator H[qi,Pi] have been integrated over in construct- 
ing the partition function. They affect the detailed func- 
tional form of the coefficients F^ etc. in Q. Since 
([5]) contains only the tj, it is clear that the expansion Q 
in terms of t{) must remain invariant if we introduce 
a local frame rotation in the normal plane spanned by 
(n^bj). This introduces a strong constraint to its func- 
tional form. In [3U], [3T] this has been utilized to show 
that in the leading order the expansion ^ is uniquely de- 
termined. It can only contain the following terms [5]- [5], 

eh. 



N-l 



N 

£=-^2 K i+1 Ki + E {2k, 2 + q ■ (k? - m 2 ) 



2\2 



2 2 



(8) 



Here the first sum together with the three first terms in 
the second sum coincide with the integrable energy func- 
tion of the conventional DNLS equation with a poten- 
tial that displays spontaneous symmetry breaking. The 
fourth (b T ) and the fifth (a T ) terms are the only two lower 
order nontrivial conserved quantities that appear in the 
integrable DNLS hierarchy prior to the energy. These are 
the momentum and the helicity, respectively. The last 
(c T ) term is the standard Proca mass term. The param- 
eters are all global and specific only to a super-secondary 
structure such as helix-loop-hclix. In particular they are 
independent of the detailed nature of amino acids. 

Unlike a force field in molecular dynamics, the energy 
function Q does not describe the fine details of the atom- 
ary level interactions such as Coulomb, van der Waals, 
hydrogen bonding etc. Instead, like an effective Landau- 
Lifschitz theory it describes the properties of a folded pro- 
tein backbone in terms of universal physical arguments. 



Full details and motivation of ^ are presented in 6 -|8J, 

m\ _ 

The remarkable property of ([8]) is that the torsion an- 
gle Ti is only subject to local interactions, all explicit 
non-local interactions are carried by the bond angle Ki. 
Furthermore, since n appears at most quadratically we 
can solve for it in terms of Ki, 



b T n 2 



d T K 2 



(9) 



When we substitute this into the variational equation of 
Ki that follows from ^ , we arrive at a generalized version 
of the DNLS equation. Its soliton solution has been con- 
structed in [B]-[5]. In particular, it has been observed that 
the soliton can be approximated by the discretized ver- 
sion of the soliton solution of the continuum dark NLSE 
soliton [9]-pT]. 



mi • e 



ci (i-s) 



m 2 • e 



-c 2 (i-s) 



gCi(i-s) _|_ e -c 2 (i-s) 



(10) 



Here the various parameters each have a natural interpre- 
tations, see [B]-j5] for a detailed description: The param- 
eter s determines the backbone site location of the center 
of the fundamental loop that is described by the soliton. 
The values of the parameters mi. 2 G [0, 7r] mod{2ir) are 
entirely determined by the bond angles of the adjacent 
helices and strands. Finally, only the c\ and c-i are intrin- 
sically loop specific parameters, they specify the length 
of the loop. The soliton profile of ft, determines the tor- 
sion angles r, by Q . According to [S] practically all PDB 
proteins can be constructed as the sum of terms of the 
form ( 10 1 , in a modular fashion from a relatively small 



number of soliton profiles. 

Following [2E] we argue that in the Frenet frames, the 
angular positions of the side-chain atoms can be similarly 
determined in terms of the corresponding «j values only. 
For this we denote by (9, (f>) the standard spherical lat- 
itude and longitude angles of the sphere that surrounds 
the C Q observer. We propose that to leading order in 
the expansion Q, in these coordinates each of the side- 
chain atoms has an energy function that has the same 
functional form as the energy function of the backbone 
torsion angles. Consequently, for each side-chain atom we 
introduce only the following two leading contributions to 
the energy 



En 



N 

E 

i=l 



-of 

2 1 



N , , 



(11) 



(12) 



Note that these contributions have been carefully selected 
so that they will not change the functional form ofpj). 
The addition of (111, (12) will only redefine the coeffi- 
cients in the Ki dependent terms in (fSl) which does not 
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lead to any change in the underlying soliton structure. 
In particular, we can still utilize the approximative soli- 
ton profile ( |10p . In parallel with ([£]) the spherical angles 
(9i, tpi) for each of the side-chain atoms are then dynami- 
cally determined by the DNLS soliton profile of the back- 
bone bond angles /tj, 



ban 2 : 



0>tp H~ b^K? 
C<p ~\~ d^pK^ 



The present visual analysis implies that in the case of 
a L-a residue the numerical values of both (bg^g) and 
{by, dtp) are vanishingly small for the Cp, C 7 and C$ car- 
bons, and for the side-chain N and O atoms in the case of 
ASN and ASP. But both {ag, eg) and (a v , c<p) have amino 
acid independent, finite and universal values that can be 
directly inferred from the Figure 2 (middle), 4, 7 and 8 
respectively, 



< 



>L-Q 



ao_ 

co 



< <fil >L-a ~ 

We now proceed to argue that these universal values can 
be understood in terms of relatively few DNLS solitons. 
We then show how these solitons can be classified using 
our graphical tools. 



VI. SOLITON VISUALIZATION 

It has been argued in the literature that in the case of 
ASN and ASP the L-a Ramachandran region become sta- 
bilized by a local but non-covalent attractive interaction 
between the side-chain and backbone carbonyls, with the 
backbone oxygen atom in a special role [T3] , [H] . Unlike 
the Ramachandran plot, our Frenet framing can provide 
information on the neighboring peptide units and we have 
investigated the directions of all backbone O atoms in our 
data set, in a group of residues around the i th side-chain 
Cp that is located in the L-a island. The result shown 
in Figure 10 displays how these O atoms are seen by our 
Frenet frame observer who is located at the i th central 
C a carbon. Our observer finds that the directions of the 
nearby backbone O atoms are very srongly localized and 
correlated. The localization is residue independent and 
extends itself over at least four different residues: 

• For the i — 2 site there is strong localization 
with a three-fold degeneracy that is reminiscent of the 
trans/gauche (sp3 hybridization) degeneracy. The data 
is consistent with vanishing values of both (bg,dg) and 
%,d 9 ). 



• For the site i — 1 we have very strong localization 
along the longitudinal (ifi^i) direction, with a tiny os- 
cillation in the latitudinal (9i-i) direction. For the cor- 
responding energy, (6 v ,d v ) are again vanishingly small 
while (bg,dg) are now small but non- vanishing. 

• For the site i we have a single localized oscillator in 
the longitudinal direction. Thus (b Vl d,p) are now small 
but non- vanishing while (bg,dg) vanish. 

• For the site i + 1 we again find the three-fold 
trans/gauche degeneracy: There are three oscillators in 
the longitudinal direction, and they are all located very 
close to the north-pole. Consequently (bg,dg) vanish 
while (b v ,d v ) do not. In fact, the ifi+i amplitudes are 
quite large. 

The localization pattern of the backbone O atoms 
means that for our Frenet frame observer the backbone 
geometry around a L-a residue shows very little varia- 
tions. Only a very limited set of extended backbone ge- 
ometries are accessible. Since the regime that covers the 
sites from the (i — 2) th to the (i + l) th involves four sets 
of bond and torsion angles, each of them defined in terms 
of three resp. four residues we conclude that the geome- 
tries reflect the non-local collective interplay of at least 
up to seven different residue sites along the backbone. 
This is in line with our proposal that the positions of 
the side-chain atoms are determined dynamically by the 
backbone, in terms of a small number of different DNLS 



soliton profiles according to (11), (12). 



To expose the soliton structures that surround the L- 
a island, we consider the distribution of the backbone 
bond and torsion angles that are attached to those C a 
carbons where the Cp atom is in the L-a position. The 
result is shown in Figure 11 on a stereographically pro- 
jected two-sphere, separately for ASN and ASP and for 
the remaining non-glycyl amino acids. 

We observe no practical difference between the 
residues. Nor do we find any practical difference between 
the various trans and g- positions. Instead, we do observe 
the following general pattern: For the backbone C a -C a 
link that precedes the L-a island, three different regions 
on the (k, t) plane are probable. These are the regions 
that we have denoted with a, b and c respectively in 
the Figure 12 (top); In this Figure we have combined 
all the data that are displayed separately in the parts a, 
b, c and d of Figure 11. After the L-a island there are 
also three different regions that are probable. We denote 
these regions with letters b and d and e respectively in 
the Figure 12 (bottom), now combining the data in parts 
e, f, g, h in Figure 11. Note that the regions b in the 
two parts of Figure 12 practically overlap. 

By inspecting the protein structures in our data set we 
conclude that the presence of a residue in the L-a island 
causes the following phcnomcnological selection rules be- 
tween the regions displayed in Figure 12. When we roller- 
coast along the backbone: 

• The region a can only precede regions d and e. 

• Both regions b and c can be followed by any of the 
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FIG. 10. (Color online:) The orientations of backbone O 
atoms around the site i that is located in the L-a island, as 
seen by our discrete Frenet frame observer at the i th central 
C a -carbon, on a stereographically projected two-sphere (top) 
and on the surrounding two-sphere that we have displayed 
from two different perspectives (middle, bottom). For the i th 
and (i — l) th atom only one position appears to be available 
while the (i+l) th and (1 — 2) th atoms each have three available 
(trans /gauche) positions. The angle <f> is measured from the 
N axis. 




T=0 1=0 



FIG. 11. (Color online:) The (k, t) distributions for backbone 
links that are attached to a C a carbon with Cp in the L-a 
island on stereographically projected two-sphere as in Figure 
1. We display separately ASN, ASP, and all the rest. On left 
column the C' a carbons in the case where the corresponding 
C 7 carbon is in the trans island, on right for those where 
the C 7 is in the g- island. First row a),c) is for link that 
precedes either ASN or ASP. Second row b),d) is for link 
preceding any other non-glycyl amino acid. Third row e),g) 
is for link following either ASN or ASP. Fourth row f),h) is 
for link following the others. 



three regions b, d and e. 

• The residue preceding either a or c is not located in 
the L-a island. 

• Both the residue preceding and following b can be 
located in the L-a island. 

• If the two residues following c are both in the L-a 
island, the first residue connects c to b and the second 
connects b to b. 
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T=0 

FIG. 12. (Color online:) The (ft, r) distributions for all back- 
bone links that are attached to a residue in the L-a island. On 
the top for the link preceding the residue in the L-a island, 
on the bottom following the residue in the L-a island. 



These are the selection rules, that limit the available 
global topology of the backbone solitons when we pass 
the L-a site. Notice that since the region b has the same 
bond angle as the standard a-helix region and since the 
torsion angles are equal in magnitude but have an oppo- 
site sign, a repeated structure in b is the right-handed 
mirror image of the standard a-helix. Consequently this 
is truly the region of left-handed a-helices. 

These selection rules classify the different possible 
DNLS soliton profiles in the presence of the L-a island. In 
particular, we have found that there seems to be no more 
than four solitons that are particularly common around 
a residue with Cp located in the L-a island. We now de- 
scribe these solitons qualitatively using our visual tools, 
as transition trajectories between the different regions 



that appear in the maps of Figure 1. These transitions 
illustrate how our observer turns at the location of each 
C a as she roller-coasts through the soliton. The results 
are shown in Figure 13. In each case the pink arrow cor- 
responds to a site where Cp is located in the L-a island; 
Recall that the bond and torsion angles are link variables, 
they connect two C a carbons according to Q. Clearly, 
the Figure 13 is but an example of a general method to 
visually classify solitons in a manner that directly reflects 
the geometry of the underlying backbone. 

• In Figure 13a, the first residue takes our observer 
away from an a-helix region to the region a in the Figure 
12 left (black arrow). This is followed by a residue in the 
L-a island, that takes the observer to the region d in the 
Figure 12 right (pink arrow). Finally, there is a transition 
to the /3-strand region (black arrow). Consequently this 
is a short soliton that takes us from the ground state 
which is an a-helix to the other ground state which is a 
/3-strand. 

• The second soliton trajectory shown in Figure 13b 
starts from the /3-strand region with a residue that takes 
the observer into region c in Figure 12 left. The following 
residue that is located in the L-a island then causes a 
transition into region d in Figure 12 right (pink arrow). 
This is followed by a transition back to a /3-strand region. 
Since the initial and final positions are in a /3-strand, this 
is an example of a soliton that combines the /3-strand 
with another /3-strand. 

• The third trajectory that we have described in Fig- 
ure 13c starts from the /3-strand region and proceeds to 
region c in Figure 12 left. From there the trajectory pro- 
ceeds to region b in Figure 12 left, with the transition 
caused by a residue in the L-a island. This is followed by 
a transition to region d and then back to the /3-strand 
region. This trajectory is also an example of a soliton 
that combines the /3-strand with another /3-strand. 

• Finally, the fourth trajectory that is also common 
in our data set is the one displayed in Figure 13d. It 
is similar with the trajectory described in Figure 13c, 
except that now the residue that is located in L-a island 
causes the transition from b to d in Figure 12. This 
trajectory is also an example of a soliton that combines 
the /3-strand with another /3-strand. 

The remarkable property of solitons c) and d) in Fig- 
ure 13 is, that they have similar overall topology and 
differ from each other only by the location of the L-a 
along the trajectory. It is quite plausible that in some 
proteins these two solitons are but two states of an oscil- 
lating discrete "breather" soliton. The ensuing proteins 
are presumable unstructured. 

Finally, for the purposes of soliton taxonomy we note 
that when we analyze the proteins in the version v3.3 Li- 
brary of chopped PDB files for representative CATH do- 
mains we find that the propensity of our solitons is largest 
in the (mainly-/?) CA level classes 2.90, 2.160 where over 
5% of all residues are in the L-a island. We also find 
that any CA level family has at least 1% of their residues 
in the island, except 1.40 where the single representative 
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FIG. 13. (Color online:) Four different soliton trajectories 
through a residue in the L-a island that are common in our 
data set, on the stereographically projected two-sphere. The 
arrow shows how the Frenet frame observed sees the soliton 
to proceed from a C a to the next C Q . In each case the pink 
line denotes the transition that is caused by the presence of a 
residue in the L-a island. The trajectory a described a soliton 
that connects the a-helix region to the ,0-strand region. The 
remaining ones all both start and end in the /3-strand region. 



with PDB code 1PPR has no residues in the island. 



VII. CONCLUSION 

We have developed a new visualization method of pro- 
teins. In the case of backbones our method provides 
information about the geometry of neighboring peptide 
units. This enables us to go beyond the regime of the 
canonical Ramachandran plot which does not contain in- 
formation on the neighboring units. As an example of 
side chains, we have visually investigated the non-glycyl 
residues that are located in the L-a region of the Ra- 
machandran plot. Independently of the amino acid, we 
find that for a discrete Frenet frame observer who roller- 
coasts along the backbone C a carbons the corresponding 
side-chain Cp carbons are always localized in the same 
direction which is clearly different from the direction of 
the Cp carbons in the right-handed region. This univer- 
sality in the orientation persists when we investigate the 
C 1 and C$ carbons, and the side chain O and ./V atom 
in the case of ASN and ASP. The results suggest that 
instead of reflecting only a local interaction between a 
given backbone unit and its residue, the L-a island is also 
associated with a largely residue independent backbone 
conformation. 

When we proceed to analyze the distribution of those 
backbone bond and torsion angles that are associated 
with the links that both precede and follow a residue that 
is located in the L-a island, we find that independently 
of the residue these angles display very similar patterns. 
Since the definition of a bond angle takes three C a car- 
bons and the definition of a torsion angle takes four, this 
prompts us to propose that the geometrical structure as- 
sociated with the presence of a residue in the L-a island 
is a soliton that reflects the interplay of at least seven 
consecutive backbone units. In particular, we have not 
been able to pin-point any obvious local reason (charged, 
polar, acidic, hydrophobic/philic) to explain the presence 
or absence of a residue on the L-a region. 

Our approach is based on a novel visualization method 
to depict proteins. This method is based on advances 
in three dimensional visualization techniques that have 
been developed after Ramachandran presented his plot. 
In the course of our analysis we have been able to observe 
several systematic patterns including potential anomalies 
in the PDB data. The visualization method we have de- 
veloped shows promise to become a valuable tool for both 
experimental and theoretical protein structure analysis 
and fold description, in particular for visually describing 
and classifying the backbone solitons and as a comple- 
ment to existing side-chain rotamer libraries. 
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